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Abstract 

We propose a new and effective means for designing stable and fast-folding 
polypeptide sequences using a cumulant expansion of the molecular partition 
function. This method is unique in that Tz, the "cumulant design temper- 
ature" entered as a parameter in the design process, is predicted also to be 
the optimal folding temperature. The method was tested using monte-carlo 
folding simulations of the designed sequences, at various folding temperatures 
Tp. (Folding simulations were run on a cubic lattice for computational con- 
venience, but the design process itself is lattice-independent.) Simulations 
confirmed that, over a wide range of Tz, all designed sequences folded rapidly 
when Tp ~ Tz- Additionally, highly thermostable model proteins were cre- 
ated simply by designing with high Tz- The mechanism proposed in these 
studies provides a plausible pathway for the evolutionary design of biologi- 
cally active proteins, which must fold and remain stable within a relatively 
narrow range of temperatures. 
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The functionality of a protein stems from the molecule's ability to assume one unique 
shape, the "native conformation," out of an untenable sea of possible conformations. Nature 
seems to have designed protein sequences capable of folding into specific conformations 
The problem of sequence design is the "inverse folding" problem: given a 3-dimensional 
target structure, predict a sequence of amino acids which will spontaneously fold to that 
structure, and remain stable with respect to that structure. 

A recent systematic approach to the sequence design problem utilizes the idea of stochas- 
tic optimization in sequence space ||: one should design sequences with a large energy gap 
between the native conformation and the set of misfolded, denatured states. The simplest 
implementation was based on the approximation that the free energy of the denatured state 
does not depend on sequence (only on amino acid composition), while the energy of the 
native state is sequence-dependent. Subsequent improvements to this algorithm accounted 
in a crude way for the dependence of the energy of unfolded conformations on sequence, thus 
relaxing the condition of fixed amino acid composition 0. The same approach with minor 
technical modifications was used in the recent work 0. No folding simulations were made 
to test the sequences designed in this work, however, making it impossible to evaluate the 
design algorithm of as well as the applicability of the heteropolymer model used there. 

The serious shortcoming of all previous sequence design methods is the conspicuous lack 
of reference to a folding temperature Tp. Stability and foldability are, however, sensitive 
functions of T F |§|9]||. Ideally, one would like to design sequences which are foldable within a 
preimposed, biologically relevant temperature range. Since the stability of the native state is 
determined by the difference in free energy between the unfolded state and the native state, 
a realistic treatment of the temperature dependence of the free energy of the denatured state 
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is paramount to any reasonable attempt of "rational" protein design. 

In this work, we report a method of rational sequence design which takes a target struc- 
ture and a desired optimal folding temperature Tz, and generates a sequence which is thermo- 
dynamically stable with respect to the target structure at a folding temperature Tp « Tz- 
(See Figure 1.) The method, which we call the "cumulant design method," is based on Fi£ 
a mean-field high temperature expansion of the single-molecule partition function, which 
allows us to estimate the free energy of the denatured state explicitly 

The energy of a model protein can be evaluated as 

N 

£ = ^ (i) 

i=l j<i 

where Ay = 1 if monomers i and j are within some specified distance range in three- 
dimensions, and otherwise. is the identity of the amino acid at position k, and B is 
the parameter set matrix, a symmetric energy matrix representing the pairwise attraction 
(or repulsion) of the various amino acids. We use the phenomenological parameter set of 



Miyazawa and Jernigan [ID|. 



The probability of finding a chain in any given state during the folding process is given 
by the canonical ensemble: 

p/r-n r 1 rp\ njll 1 g(x i+ i - Xj) exp(-E({f }, M)/r) 

lW,W ' j " Z({a},T) [Z) 

where the set {x} represents the positions of the N monomers, {a} represents the amino 

acid sequence, and g explicitly represents the constraints imposed by the chain. [[□]] (We 

absorb Boltzmann's constant Kb into T.) The conformation-space partition function is: 

Z({*},T)= £ n^(x m -f t )exp(- E({ ^ {a}) ) (3) 



It is through the partition function Z that properties of the unfolded state affect the stability 
of the native state. 

We can rewrite the partition function as an intergral over a continuous density of states: 

/oo 
e- E ' Tz p{E)dE (4) 
-oo 

where is the total number of possible conformations and p(E) is the normalized density 
of energy states. The cumulants (c n ) of p(E) are defined by 

log/ e-* E p(E)dE=Y, [ - J r-(cn). (5) 

J— 00 1 n\ 



71=1 

Solving for p(E) and substituting into yields 

jV-1 / 00 / jNn \ 

This is a high-temperature expansion of Z(T). 

It is useful that the cumulants of independent random variables add linearly. Thus if we 
know the cumulants of the energy probability distribution for a single contact, the energy 
cumulants of a system with Nc independent, identically distributed contacts is simply 

{{c 1 )...(c n )} = {Nc(c 1 ) s ...N c (c n ) s }. (7) 

Of course, (^) is useful only if we can estimate the cumulants of E. We make this estimate 
in the mean-field, by first calculating the set of simple moments (jJ> n ) B of the single contact 
energy E s : 

W,3« (8) 

In our mean-field approximation, monomers i and j are assumed, during random motion, to 
interact with a probability P^, determined by the chain connectivity. Since the correlation 



length in globules is small a good estimate of Pij for the globular, denatured state is 
the uniform distribution over all allowed contacts fl2| , [T3|j . (On a cubic lattice, this excludes 
contacts for which i — j is even.) 

We utilize the high temperature expansion (|6|) and the noted properties of cumulants to 
present the cumulant algorithm for sequence design: 

1. Begin with a target structure and a random sequence. The target structure will 
remain fixed, as the optimization occurs in sequence space. 

2. Compute the first n moments of the single-contact energy (eq. ^j. 

3. Compute the first n single-contact cumulants in terms of the first n moments, using 



a known recursion relation 1114 . 



4. Utilize (|7|) to approximate the cumulants of the total molecular energy. We used the 
maximally compact Nc, which was shown computationally to be the optimal choice. 

5. Estimate the partition function using ([]). Note that the temperature used in (|6p is our 
cumulant design temperature Tz, which is predicted to be the optimal folding temperature. 

6. Calculate the fitness parameter, the canonical probability of native state occupancy. 
For convenience, we actually maximize the quantity Pz which is monotonic with P: 

PziT) = -E N -J2 < c « > 0) 

n=l li - ± 

7. Make a random mutation in sequence space. 

8. Compute Pz for the mutated sequence, and accept or reject the mutation according to 
the Metroplis algorithm. One must use a convenient "sequence space selective temperature" 
T se i to run the simulation; this is unrelated to the cumulant design temperature T Z - 

9. Continue the monte-carlo search until equilibrium is reached, and Pz is (approxi- 



mately) maximized. 

The lack of an accurate "energy field" for real proteins currently prevents us from de- 
signing foldable proteins in the laboratory setting, so we must test our design hypothesis 
within the context of computational models. For simplicity we chose a standard cubic lattice 
model. Under such a model, each monomer is represented as a vertex on a cubic lattice, and 
two monomers "interact" if they are lattice neighbors but do not occupy successive positions 
on the chain. It is important to note that the cumulant design method itself is not beholden 
to any particular model; the cubic lattice is used merely to test the method. 

We designed a number of sequences for different target backbones and with various 
design temperatures T z , and tested the thermodynamic and kinetic properties (stability 
and folding times, respectively) of these sequences. For our target (native) structures, we 
used five random, maximally compact 36-mer backbones. We used several structures in order 
to make sure that any properties noted are not artifacts of a particular native conformation. 

Ten sequences were designed for each of five cumulant design temperatures (Tz = 
0.20, 0.28, 0.36, 0.44, 0.75), on each of the five random backbones, for a total of 250 designed 
sequences. The cumulant expansion was cut off at fifteen terms, enough to infer conver- 
gence of the series. Sequences were designed using the aforementioned Miyazawa-Jernigan 



parameter set |TIJ, and were "optimal" in the sense that they maximized Pz(T) within the 
approximations used. 

We ran 10 folding simulations, at various temperatures, on each of the designed se- 
quences. The runs were 1, 000, 000 monte-carlo (MC) steps in length. The folding algorithm 
was that of Hilhorst []15||, utilizing three standard moves: crankshaft, corner flip, and tail 



moves. One M C step consisted of one randomly chosen (sterically possible) move, whether 



or not that move was accepted according to the Metropolis criterion. 

Figure 2 demonstrates that the optimal folding temperature for a sequence is strongly 
correlated with the the cumulant design temperature Tz for that sequence. This is true over Fig. 2 
a factor of two in absolute temperature; within this range, the minimal folding time occurs 
for T F « T z . The relationship breaks down at high T z , for reasons that will be discussed 
subsequently. 

The designed sequences were also tested for stability. Starting from the folded state, 
sequences were subjected to folding simulations of 10, 000, 000 MC steps. Figure 3 shows 
clearly that sequences designed with high Tz were more thermostable than sequences de- 
signed with low T Z - (We acknowledge that longer runs would be necessary to confirm the 
stability data at very low Tp.) 

The sequences designed by the cumulant algorithm fold at least as rapidly as those de- 
signed for maximal gap (for comparison see e.g. the cumulant method could even 
design fast-folding 64-mers (median folding time 2.1 million MC steps). Most importantly, 
the cumulant design temperature T z proved a good predictor of the optimal folding tem- 
perature, allowing us to design sequences with specified (optimal) T F . 

Of particular interest is the newfound ability to design thermostable sequences. To this 
end it is instructive to study which features of sequences are responsible for optimal behavior 
at high folding temperatures Tp. One can keep only two terms in the cumulant expansion 
to approximate the optimized quantity given by (|9|): 
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P z n-E N + N C (B -^) (10) 
(a similar expression for the partition function of a disordered heteropolymer was obtained 
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in [|T^,|T^]) where Bo is the average interaction energy and a 2 is its variance. It is clear from 
( [TOP that at low temperature, sequences with minimal dispersion should be selected, perhaps 
at the expense of compromising low target state energy En, while at higher T this factor 
is less important and what is mostly optimized is En- Figure 4 shows that this is precisely 
what occurs. 

Although our algorithm is effective over a factor of two to three in absolute temperature, 
it is not generally possible to design sequences which are stable at temperatures outside 
of this range. It was effectively impossible to design sequences with T z < 0.20, due to 
divergence of the high-temperature expansion (|6]). At high Tz, design efficacy eroded due 
to limits on the interaction energies; no molecule can be stable at temperatures higher than 
the characteristic T of its strongest interactions. Additionally, our assumption of a compact 
denatured state becomes progressively less accurate as T increases. 

Any one protein can indeed fold only within a specific thermal environment. One ex- 
ample is the enzyme ribonucleotide reductase [|18||. The version of this enzyme present in 



the mesophilic bacterium Lactobacillus leichmannii is maximally active at 49°C, while its 
counterpart in the thermophile Thermus X-l demonstrates maximal activity at about 70°C. 
Importantly, the activity curves are narrow: at 49°, the thermophilic enzyme is less that 
30% active, and at 70°, the mesophilic homologue is completely inactive. Although it might 
be impossible to design a single protein which folds at all habitable temperatures, nature 
has been able to give each organism the proteins it needs to thrive in its own thermal 
environment. 

Our cumulant design method seems to give a reasonable estimate of the molecular par- 
tition function: we are able to design model proteins which are stable (and foldable!) at a 
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given temperature. We thus have a useful model for the thermodynamic properties of real 
proteins. 

This work was supported by the Hertz Foundation for the Applied Physical Sciences and 
the Packard Foundation. We are grateful for useful discussions with Sharad Ramanathan 
and Victor Abkevich. 

I. FIGURES 

Figure 1: Schematic of the sequence design process. The goal is to design a sequence 
which folds to a specific conformation, at a pre-specified design temperature T z . 

Figure 2: Median folding time vs. folding temperature (36-mer). 250 sequences were 
designed for five randomly-chosen, maximally compact backbones, at five cumulant design 
temperatures T z . Each sequence was folded 10 times at each of several temperatures T F , 
and folding times were sorted by T z and T F . For T z < 0.5, optimal folding occurs near 
Tp = T z , as predicted. (Folding runs were cut off at 1 million steps.) 

Figure 3: Stability vs. folding temperature (36-mer). The same 250 sequences designed 
for Figure 2 were tested for denaturation. Starting from the native state, the sequences were 
subjected 10,000,000 monte-carlo steps. Here, stability is defined as the percentage of MC 
time spent with > 39 of the 40 native contacts intact. 

Figure 4: Energy of the target conformation (left axis) and dispersion of interaction 
energies (right axis) for sequences designed at different T z 
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